Welcome to Apache Spark with R

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

http://spark.apache.org/

In this notebook we will introduce basic concepts about SparkSQL with R that you can find in the SparkR documentation, applied to the example people dataset. We will do two things, read data into a SparkSQL data frame, and have a quick look at the schema and what we have read.



In [1]:

    
##Creating a SparkSQL context and loading data¶
library(SparkR)
sc <- sparkR.session(sparkConfig = list(spark.app.name = "R Spark Test"))









    



Attaching package: 'SparkR'

The following objects are masked from 'package:stats':

    cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from 'package:base':

    as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
    rank, rbind, sample, startsWith, subset, summary, transform, union







    



Launching java with spark-submit command /usr/local/spark/bin/spark-submit   sparkr-shell /tmp/Rtmp3mWazA/backend_port9e113cc0dc



In [3]:

    
people <- read.df("/opt/datasets/people.json", "json")



In [4]:

    
printSchema(people)









    



root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [5]:

    
head(people)









    





age name

	1 NA     Michael
	2 30  Andy
	3 19    Justin



In [6]:

    
head(filter(people, people$age > 19))









    





age name

	1 30  Andy